SOME STATISTICAL ISSUES REGARDING THE ESTIMATION OF / NL IN CMB 

NON-GAUSSIANITY 

L. TENORIO* 

Abstract. We consider the problem of estimating the parameter /nl m the standard local model of primordial CMB 
non-Gaussianity. We determine the properties of maximum likelihood (ML) estimates and show that the problem is not 
the typical ML estimation problem as there are subtle issues involved. In particular, The Cramer-Rao inequality is not 
applicable and the likelihood function is unbounded with several points of maxima. However, the particular characteristics 
of the likelihood function lead to ML estimates that are simple and easy to compute. We compare their performance to 
that of moment estimators. We find that ML is better than the latter for values /nl away from the origin. For small 
values of /nl 5 the fifth order moment is better than ML and the other moment estimators. However, we show how for 
small /nh one can easily improve the estimators by a simple shrinkage procedure. This is clearly important when the goal 
is to estimate a very small /nl. In the process of studying the inference problem, we address some basic issues regarding 
statistical estimation in general that were raised at the Workshop on Non-Gaussianity in Cosmology held in Trieste in 
July 2006. 

1. Introduction. We consider the standard local model of primordial CMB non-Gaussianity 
where sky maps are modeled as 

Y^Ut + f^iUf-a 2 ), (1.1) 

with U = (U±, U n ) a zero-mean Gaussian vector with covariance matrix S and Var([/i) = a 2 . The 
covariance matrix is defined by the covariance function of the random field. The objective is to use the 
data vector Y = (Fi, Y n ) to estimate /nl- 

As opposed to the spectrum/moment estimators considered by Creminelli et al. [3] and Babich 
PP, we will also consider maximum likelihood and shrinkage versions of the estimators. In addition, we 
will focus on studying their performance under the assumption that the leading order model is correct. 
That is, while these authors considered perturbation models of (|1.1|) . we will assume the local model 
is correct. The reason is that to understand properties of perturbations, it is important to understand 
properties under the ideal case where the leading order model is correct and standard statistical tools 
can be used. Extending the statistical methods to perturbation models is much more difficult; it is not 
clear to this author what methods will remain valid and to what degree. 

To compare estimators we will mainly use the mean square error so as to take into account bias and 
variance. However, to make a proper comparison we will have to address some technical issues regarding 
statistical estimation in general that were raised at the Workshop on Non-Gaussianity in Cosmology 
held in Trieste in July 2006. 

For example, some of the comments and talks at this workshop, as well as in published papers, 
seem to give the message that the Cramer-Rao (CR) inequality is an ideal way to decide the optimality 
of estimators; that this inequality can be used to decide if an estimator provides all the available 
information about an unknown parameter, and that one should strive to find unbiased estimators of 
minimum variance. However, none of these statements is quite correct. We will show this with simple 
examples as well as with the problem of estimating /nl to test CMB non-Gaussianity. 

In order not to keep repeating the same references for the different statistical concepts mentioned 
in this note (in italics), we refer the reader to two excellent general references where all the definitions 
can be found: Lchmann & Casella [7] and Casclla & Berger 

We start by recalling the scalar version of the CR inequality: It states that under 'regularity' 
conditions the variance of an unbiased estimator 9(Y) of 9 satisfies 

Var(^))>^, 

where the Fisher information J- (9) is defined by 

T{9) = Var ( d/d9logL{9; Y) ) 
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and L(9; Y) is the likelihood function. The first problem when trying to apply this bound to estimates 
of /nl is in the regularity conditions; one of the conditions is to be able to pass the derivative d/d6 
of the likelihood function under the integral J dy. We will see that this cannot be done for the /nl 
problem. However, even if the regularity conditions were satisfied, we would still have to determine if 
there are unbiased estimators of /nl- The answer again is negative. Finally, if the conditions were met 
and there were unbiased estimators matching the bound, it still would not follow that the estimators 
would be any good. 

But before we consider these problems, we believe it is important to start with two simple examples 
that can be completely worked out and that illustrate some of the basic issues we will address: 

Example. Consider the uniform distribution on the interval (0,29). The goal is to estimate 9. The 
probability density function (PDF) is 

where I a stands for the indicator function of the set A. The likelihood function of 9 is 



The first two derivatives of the log-likelihood are: 

did 2 1 
— \ogL(9;x) = jjI (x/ 2 tO0) [6), \ogL{9;x) = -— I {x/2iOo) (0). 

First, note that the derivative 3/ 89 cannot be passed under the integral sign: 
r °° 8 f 2e 1 19 



This means that the conditions required for the CR inequality are not satisfied and therefore it is not 
necessarily valid. In fact, since the likelihood function is constant in X with probability 1, it follows 
that T(6) = 0. Or, formally, 



F(0) = Var ( d/d9 log L(6; X) ) 



= E 



( 8/ 89 log L(9; X)f - [ E ( 8/89 log L(6; X))] 
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thus the CR bound is infinite. Yet, 9(X) = X is an unbiased estimator of 9 with finite variance 9 2 /3. 
The ML estimator is 9 Mh (X) = X/2. This estimator is biased but its mean square error (MSE) is 
smaller than that of the unbiased estimator: 



MSE 



( 9 ML (X) ) = Var ( 9 ML ( X) ) + Bias ( 9 ML (X) ) ' = ^ 



l)2 



< — = MSE 9{X) 



More generally, for a sample of size n, the sample mean 9 n [X\, ...,X n ) = X = (X\ + ■ ■ ■ + X n )/n is a 
consistent estimator of 9 (i.e., it is unbiased and its variance decreases to as n — > oo): 



E 



(9 n {X))=9, Var(0 n (X)) = 



9 2 . 

6 n 
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It is also easy to see that the ML estimator is 9^ L — X( n \/2, where -X"( n ) is the largest of the Xj. 
The bias and variance of this estimator go to zero as the sample size increases 



E(^ L (X))=-^, Var(^ L W 



(n + 2)(n+l) 2 



which shows that the ML estimate is also consistent; the requirements for consistency of ML estimates 
are different from those of the CR inequality. The variables are usually called order statistics and 
make their appearance here because the conditions X\ < 29,...,X n < 29 are equivalent to the single 
requirement X(„) < 29. 

The estimator 9 n is known as a moment estimator because it is obtained by first solving for 9 as 
a function of the moments of X and then substituting the sample moments for the population ones 
(correlation functions, and thus the bispectrum, are examples of moment estimators). It is well known 
that moment estimators can usually be improved by conditioning on what are called sufficient statistics 
(functions that contain all the information about 9 that is provided by the data). Moment estimators 
are usually not given in terms of sufficient statistics while ML estimators are. For example, to improve 
9 n we condition on the sufficient statistic X(„) and obtain the estimator 

0£(X)=E(MX) |X (n) ) = ^±lx (n) . 

This estimator is unbiased and its variance is smaller than that of the original 9 n : 



Var 



This confirms that the moment estimate did not contain all the information about 9. 
Finally, recall that under regularity conditions 3^(9) can also be computed as 

T{9) = -E ( d 2 /d9 2 log L(6; X) ) 

but not for our simple example for the same reason the CR inequality is not valid: we have 

T{9) = -E ( d 2 /d9 2 log L{6; X) ) = 1 /9 2 ^ 0. 

Example. We have seen that the CR inequality docs not have to hold if the support of the PDF 
depends on the parameter. But, could it hold for values of 9 for which the support is 'essentially' 
(0, oo)? This is not necessarily even in the limit is a finite interval. Here is an example: We make a 
slight change to the PDF in the previous example and use the uniform U(0, 1 + 6*). The support of the 
PDF still depends on 9 and so the CR inequality is not valid. But suppose we are only interested in 
9 <C 1. In the limit as 9 — ► 0, the support becomes (0, 1). Hence, it may be tempting to conclude that 
the CR inequality is 'approximately' valid for small 9; it is not. Just as before, the Fisher information 
is F{9) = and the bound is again infinite. And, just as before, there are unbiased estimators (e.g., 
9 n (Y) = 2X — 1) of finite variance. The problem of estimating /nl is even worse because the likelihood 
function blows up at the boundaries. 



The lessons to learn from these simple examples are: (i) The CR bound does not have to be valid 
if the support of the PDF depends on the parameters to be estimated; (ii) The CR bound is not always 
applicable and even when the bound is infinite, there may be consistent unbiased estimators of finite 
variance; (iii) an unbiased estimator is not necessarily better than a biased one. In fact, there are 
many cases where no unbiased estimators exist, or where a biased estimator is better than the best 
unbiased estimator. Furthermore, although not shown in this example, there are many examples where 
an unbiased estimator has a variance that matches the CR bound and yet its MSE is larger than that 
of a biased estimator; (iv) moment estimators (such as the bispectrum) do not usually contain all the 
information about the unknown parameter that the data provide and therefore can be improved by 
conditioning on sufficient statistics. 
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We will show that the problem of estimating /nl is similar to that of estimating 6 in the examples 
above but it is slightly worse in that the likelihood function is unbounded, it has more than one 
maximizer and there are no finite variance unbiased estimators of /nl- 

The rest of the note is organized as follows: In Section |3 we start with a particular case of the 
local model to show that the conditions for the CR inequality of /nl are not satisfied. We 

determine some of the particular characteristics of the likelihood function and derive simple expressions 
for estimates based on maximum likelihood. In Section [3] we return to the general local model to 
extend the results found for the simple model. Again we find simple expressions for the ML estimates 
but find that neither the moment nor the ML estimators are better for all values of <b. We then make 
shrinkage modifications to the estimators to improve their performance for small values of /nl- Some 
technical details are left to the Appendix. 

2. The toy case. We now return to the local model l|l.lfl and start with a particular case described 
in Section III of Babich 1 . This model is based on independent and identically distributed observations 



Y^^ + h^Uf-a 2 ), (2.1) 

where the Ui arc independent A(0, a 2 ). For simplicity, we will write <fi instead of /nl- 

2.1. PDF and likelihood function. To derive the PDF and study the CR bound, it is enough 
to consider the case of a single Y = U + 4>{U 2 — a 2 ), which can be written as 



Y = 4>(j 2 V - [ — + (ha 2 



1 



where 



2cj)a 

By definition V has a noncentral \\ distribution with noncentrality parameter A = 1/(4<t 2 </> 2 ) and thus 
its PDF is 

1 e -("+*)/2 , x 

p(V, A) = — = = — cosh VXv I v>0 . 

V27T \/v \ J ~ 

The PDF of Y can be easily derived as Y is just a rescaled and shifted V. The problem is the condition 
V > 0; this boundary makes the support of the PDF of Y depend on the unknown parameter <p. The 
PDF can be written as 



p(y,4>) =p {yA){y)h<o +p + (y»(y)^>o, 



where 



p {y,4>){v) = t= — = - r , = „ =j=, cosh — 2 | , | 3/2 J A± 

\/27rcr 2 y/\y + (j)cr 2 + l/4<j)\ \ 2<r 2 \(b\ 3 / 2 J 

with the sets A ± defined as A + = { y > -cp a 2 - 1 /4c/> } and A~ = { y < —(j) a 2 - 1 /4<j> }. 

In particular, for <f> > the support in y of p(y, cf>) has a left boundary defined by A + , which depends 
on <j). This is a problem for the CR inequality because the derivative d/d(j> cannot be passed under the 
integral J dy. In fact, having a PDF whose support does not depend on the unknown parameters is 
often stated as a requirement for the CR inequality (e.g., Bickel & Doksum 0). This is an important 
point that cannot be overlooked. 

In our toy problem the support of p(y, cf>) not only depends on (b but the likelihood function also 
diverges to infinity at the boundaries of A ± . However, one can still define reasonably good ML-based 
estimates. In fact, we will show below that the boundaries actually lead to simple estimates that in the 
general case depend only on a 2 and not on the full covariance matrix. 
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Fig. 2.1. Likelihood function 12.21 for y = 1/2. 



2.2. Maximum likelihood estimates. The shape of the likelihood function depends on the value 
of y. On the region <f> > the likelihood is not zero only if y is such that cj> belongs to A + ; on <j> < 
we need a y such that <j> belongs to A~ . It is easy to see that when | y | < a (or when | y \ < 1 if we 
normalize the data to estimate 4>cr), the likelihood function is just 

L(<f>,y) =p + {yA)h>Q+p~{y^)h<o- (2-2) 

This function has a unique maximum and the ML estimate is well defined. For example, Figure 12. f I 
show the likelihood function of <pa for y = 1/2. 

The case | y \ > a is more pathological. For > and y < —a, we need 

0< 2a 2 ° r 0> 2^ ' 

which delimits the boundary of A + . Since the likelihood function blows up at the two limits, it is 
maximized by setting (j> to be either of the limits. Hence the maximum likelihood estimate is not well 
defined. A similar thing happens for (f> < and y > a . However, this does not mean that neither of 
the two possibilities is good. In fact, we get a reasonably good estimate by averaging them: Define the 
ML-based estimate of (j> as 

\Y\/2a 2 if Y<-a 

c^ ML ( Y ) = { maximizcr of (E3 i£\Y\<a 
~\Y\/2a 2 ify>cr. 

It may seem strange that the estimate is positive when Y is negative and conversely (it also happens 
in Figure |2~T|) but it can be explained heuristically as follows: If Y < — cr, ten U + 4>{U 2 — a 2 ) < — a, 
that is, 

(j)(U + a)(U-a)<-(a + U). (2.3) 

But since U is Gaussian N(0,a 2 ), it is in the interval — a < U < a with probability 68%. Hence, about 
70% of the time we expect a + U > and U — a < and thus the right hand side in (|2.3|) is negative 
and for the left side to have the same sign we need <j> > 0. Similarly, we need < when Y > a. 
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A better estimate is obtained by taking in each case the quadratic root closest to zero 



\Y\-(Y 2 -a 2 ) 1/2 )/2a 2 if Y < -a 

2 ML ( Y ) = \ maximizcr of if | Y \ < a 

\Y\ + (Y 2 -a 2 ) 1/2 ) /2a 2 if Y > a. 

In the section below we compare the MSE of these two estimators to that of some moment estimates. 

2.3. Moment estimates. We now compare ML to moment estimators of <\>. To make our point, 
it is again sufficient to consider the scalar case of a single Y. 

We know that the odd moments of a Gaussian variable U ~ N(0,a 2 ) arc zero while the even 
moments are given by 

T7ir7-2ro (2m)! 2m r < ~2m 
2 m m\ 



It follows that 



J-EY 2m = E ^,2k (^) 2fc (2.4) 

2fc<2ra 

L_ E y2 m+ i = E 2m+1>2k+1 (<j>af k+1 , (2.5) 



a 2m+l 

2fe+l<2m+l 

where the sums are over even and odd indices, respectively, and 

As we have assumed that a is known, we focus on estimators of the dimcnsionlcss parameter (pa using 
the normalized random variable Y/a. For example, the first two estimators based on the third and fifth 
moments are: 

ftV (Y/a) = J- (Y/af, ^ (Y/a) = J- (Y/a) 5 . (2.6) 

-^3,1 -^5,1 

The question we want to address now is whether <p^ is better than any of the other moment estimators. 
We use the MSE to compare (p^\ <f>^ and the ML estimator. The variance and bias of (p^ and <p^ are 
easily derived using (|2.4|) and (|2.5|) . The MSE of the ML estimates is calculated through simulations. 

Figure E21 shows the MSE of (3) , <^ (3) and the two ML estimators X ML and <^ 2 ML . We see that 
</>( 3 ) is better than <p^ except for small values | a<p | < 0.05. For values | a<p \ < 0.02, (f>^ is better than 
the first ML based estimate (p^. The ML estimate has much smaller MSE than the moment 
estimates. However, we will see that the general case is not as clear cut. 

2.4. Non-existence of unbiased estimators of (p. In the simple example discussed in the 
introduction, the CR bound is infinite but there are finite variance unbiased estimators of the unknown 
parameter. However, unbiased estimators do not always exist and our toy model provides one such 
example. 

Suppose there is a finite variance unbiased estimator <p{Y) of <p based on Y . That is, 

E($(Y)) = <t>, Var(0(Y))<oo 
for all <p £ IR and any a > 0. Then, by dividing each Yi in (|2.1(l by a constant a^Owc obtain 



Yi = Vi + $ (lJ 2 -a 2 ), 
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Fig. 2.2. Mean square error (MSE) of the moment and ML estimators of oij> for the toy model 12.11 . 

where Yi — Yi/a, a = a/\ a | and C/j = Ui/a ~ N(Q,a 2 ). But since <fi(Y) is an unbiased estimator, we 
must have E y 4>{Y) ^ = <p and therefore 

1 ' /¥ ' (2.7) 



*.(*■) 

is an unbiased estimator of <f> for any a^O. This implies that 

(y)p(ayi,4>) ■ ■ -p(ay n , (f>) dy 



for any a/0. Hence, the inverse of the left hand side has to be a monomial in a of degree n — 1. In 
particular, it is a constant for n = 1. We consider the case n = 1 in the appendix and show that the 
condition is impossible to meet. Hence, there arc no unbiased estimators of finite variance based on Y. 
Of course, one would still have to check if there arc unbiased estimators based on more that one Y^ (the 
number of such Yi is called the degree of the estimator) . 

3. The general case. We now return to the general case l]l.ip . As in the scalar case, a simple 
transformation leads to 



Y 



1 



with 



v ={ u+ h)-*{ u+ h 

where stands for point- wise multiplication, and 1 = (1, 1). The PDF of V is of the form 

p(V,<t>) =F(V,<j>)I Vl >o---Iv n >o, 
for some 'nice' exponential function F(V, <j>) that will be defined below. It follows that the PDF of Y 



F 



Y 



> 



Y n +l^ 



> 



(3.1) 
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Fig. 3.1. The left panel shows the probability that either Yn) < —1 or Y^ n j > 1 as a function of cr<f> for independent 
samples of sizes n = 10 and n = 40 of the local model 11.11 . The right panel shows the probability that Y(i) < —1 and 
Y {n) > 1. 

where = 1/40 + a 2 4>. 

Just as in the scalar case, the support of the PDF of Y depends on <j>; it is defined by the two sets 

B + = { Y (1) > -£+ } , B~ = { Y [n) <-!*}, 

where Y(i) and Y( n ) denote the smallest and largest values of the sample. Hence, the CR bound is again 
not applicable but just as in the toy local model, there are simple expressions for estimates of 4> based 
on maximum likelihood. 

3.1. Maximum likelihood estimates. To find ML estimates of <fi in the general case, we first 
need to derive the function -F(V, 4>) mentioned in the previous section. This is easy to do by computing 
the probability P [ V\ < V\, V n < v n ]. The final result is: 



F(V,<i>)<x 



G(V, 0,£), 



G(v,</>,n) = g(^ + (-i) fel ^T,-,^ + (-i) fcl ^s; 

fci,...,fc„=0 



.9(«,S) 



1 



|£|™/ 2 

To be consistent with Section ITT1 write the likelihood function of <fi as 

L(cf>; Y, £) = p- (Y, 0, £)/ 0<o + p+ (Y, <f>, £)/ 0>o 

with 



1 



1 



G 



,S I B ±. 
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As in the scalar case, to find the maximum likelihood estimates we need to consider different cases. For 
Y(i) < —a the likelihood blows up at the roots of a 2 (ft 2 + 4>Y(i) + 1/4 = 0, and there are two values of <f> for 
which this happens. A similar thing happens with Y( n ) > a and the two roots of a 2 4> 2 +<pYi n \ + 1/4 = 0. 
What is good about these cases is that they do not require the use of the full covariance matrix X, only 
the variance a 2 is needed. The covariance matrix is needed only when | Yr-i) | and | Yj- n ) | are both less 
than a. In this case the likelihood remains finite and has to be maximized. 

However, in a large sample one expects Yru to be small and Y(„) to be large. Thus the case 
I I) I Y(n) I < & should be rare in large samples. On the other hand, depending on <fi, it is not always 
possible to have Y < — a. For example, it cannot happen for <fi = l/2er. However, with high probability 
Y(i) < — c or Y(n) — a an( l m cither case the ML estimates are roots of the quadratic equations that 
only require a 2 . 

For example, the left panel in Figure I5TT1 shows the probability that Y^ < —a or Yr n \ > a based 
on independent samples of size n = 10 and n = 40. We see that with only n = 10, the probability is at 
least 82% that either Ym < — a or Yr n ) > cr . With n = 40 the probability is very close to one. The 
right panel shows the probability that both cases will happen. We see that for large samples the two 
cases do not happen together only when <j<fi ~ ±1/2. But of course, in the general case the samples 
Yi are correlated because U has a full covariance matrix but for sky maps with thousands of pixels we 
expect more than 40 'degrees of freedom' so we still expect the same behavior of small and large values 
of Y(!) and Yr n \, respectively. 

The selection of the appropriate quadratic roots depends on the range of values of cr<f) under con- 
sideration. However, a simple geometric argument can be used to select them: if <fi > 0, then, as a 
function of U, the parabola U + <f>(U 2 — a 2 ) has a minimum at U = —1/20 with minimum Y value 

Y^ = ---a 2 ^ 

and therefore is a solution of a 2 <f> 2 + <fi Y m ; n + 1/4 = 0. There are two solutions of this equation, 
depending on whether 0>l/2crorO<0< l/2er. These solutions are functions of Y m i n . We have 
a similar situation when <f> < 0, in which case we find <fi as a function of Y max . Hence, we have the 
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4 6 8 1.5 2 2.5 3 



order=2k+1 log(n) 

Fig. 3.3. The left panel shows the limit, as <f> — > 0, of the MSE of moment estimators (f>( 2k + 1 ) as a function of order 
2fc + 1. It shows that </>( 5 ' has the smallest MSE near <f> = 0. The right panel shows the decrease with sample size n of the 
MSE of Y( n ) as an estimator of Y ma x ( as defined in the text ) . It explains the decrease of the MSE of the ML estimates 
in Figure IxH 



following ML estimator 



; ML 



= < 



where the roots are defined as 



Y(i) I ± 



y2 
r (l) 



r_ 

±1/2(7 

rt 
„- 



1/2 



0<(f>< 1/2(7 
<j> > l/2a 

<j> = ±1/2(7 
-1/2(7 < < 

0<-l/2, 



(3.2) 



Yi 



(n) 



l±(^)-- 2 ) 



1/2 



2a 2 



2a 2 



For example, let us return to the case where the Yi are independent. Define the third and fifth moment 
estimates as in H2.6H but averaging over the n observations. Figure 13.21 shows the MSE of the ML 
and moment estimates for n = 50 (left panel) and n = 800 (right panel). We make the following 
observations: (i) The fifth and third moment estimates, especially the former, provide better estimates 
for values | aqb \ < 0.05 but otherwise the ML estimator has much smaller MSE; (ii) The decrease in 
the MSE with increasing sample size is much better for the ML estimators. This decrease is hardly 
noticeable in the moment estimates for larger values of a<p where the estimates arc dominated by the 
bias. 

Near (f> = the smallest MSE is that of (5) because as -> 0, the MSE of 0( 2fe+1 ) is 



MSE( 



/T(2fc+1) 



<p=0 



Eik+2,0 
n ^2fc+l,l 



The left panel in Figure 13.31 shows the plot of this MSE (for fixed n) as a function of moment order 
2k + 1; the minimum value is at 2k + 1 = 5. Thus, for small <fi, is better than <f>( 3 > and the other 
higher order moment estimators. 

The large MSE of <p ML near the origin comes from ML providing estimates that are, on the average, 
greater than acj) (in absolute value). From the Appendix, we know that there are problems near the 
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origin. It is also clear that as <fi ~~ > the parabola, used to choose the roots, becomes a line and the 
problem is ill-posed. It should be possible to improve ML estimates near the origin using shrinkage 
estimators or penalized likelihood. We return to this below. 

The fast decrease in MSE of the ML estimate is interesting and can be explained as follows. Esti- 
mating (p reduces to estimating l^in (for <p > 0) and Y max (for <p < 0) with Ym and Y/ n -\, respectively. 
The convergence of the corresponding estimates is then explained by the convergence of Y^ to Y min 
and Y( n ) to Y max , respectively. For example, the right panel in Figure shows the MSE of Yr n \ as an 
estimator of Y max as a function of n for a<p = — 1. We see the same decrease (as 1/n 4 ) in MSE as in 
Figure 13.21 It is not difficult to study the convergence rate and asymptotic distributions of Yr n \ and 
Y(i) using extreme value theory (e.g., Galambos 

3.1.1. Performance for small <fr. We now focus on small values of <j> to determine if any of the 
estimators discussed above is better than the simple zero estimator: <t>o(Y) = . Figure IPl shows 
the results for | acfr | < 0.03 and sample sizes n = 50 (left) and n = 800 (right). This time we show a 
relative measure of the error defined by V(<f>) = MSE 1/2 /| $ \ (<t> ^ 0). Estimators that are better than 
<po should have V < 1. The figure shows that with n = 50 neither of the three estimators improves on 
the zero estimator. For the larger sample, (j)^ and cf>^ are an improvement but slight and only for 
| o~cj) | > 0.02. However, this is not the best we can do. Our estimators can be improved near the origin 
using ideas based on shrinkage estimators. 

For example, to estimate a mean vector m with observations Y — m + noise, one can minimize 
| Y — rh || 2 + A || m || 2 (where A > is fixed) over rh and end up with rh = Y/(l + A). This estimator 
'shrinks' toward the origin and it is the result of minimizing a goodness of fit norm that 'penalizes' 
vectors of large norms. This is a typical procedure used to obtain solution of ill-posed inverse problems 
(e.g., O'Sullivan [Jj], Tenorio |10|). Another example is the well-known James-Stein shrinkage estimator 
of a multivariate Gaussian mean, which improves on the minimum variance unbiased estimator. 

The idea is then to define estimators of the form 4> s h(Y) = (1 — \(Y, n)) 4>(Y), where X(Y, n) takes 
values in the interval (0, 1). We get the estimators </> and 4>q, respectively, in the limits as A — ► and 
A — ► 1. The problem is then to select a good shrinkage factor. In our case, even the simple choice of a 
constant factor A already leads to improved estimates. We have chosen the factor that gives the best 
results for each estimator. The dashed lines in Figure 14.11 show the value of V for shrinkage versions 
of 4>^ and 0> ML . For the small sample, only the shrunk ML improves on the zero estimator and only 
for | a<j> | > 0.017. For the larger sample, the best results are those of the shrunk tp^ and (f> ML . Both 
improve on </>□ for | acj) | > 0.011 but ML leads to a much smaller MSE. One should be able to get even 
better results with more sophisticated choices of X(Y, n). 

4. Summary. We have compared higher order moments to maximum likelihood (ML) estimators 
of /nl an d found that no estimator is better than the others over all values of <p. The selection of an 
appropriate estimator will depend on the relevant range of values of <p. For very small values, the fifth 
moment leads to the estimator with smallest mean square error (MSE), followed by the third moment 
and ML. But we have also shown that even for small /nl! the fifth moment is not the best as all the 
estimators considered here improve with the introduction of a shrinkage factor. In practical applications, 
the difference between the estimators will depend on the size of the sample and the covariance structure 
that determines its 'effective degrees of freedom'. The selection of shrinkage factors will also depend on 
the particular covariance structure and sample size. 

We found that away from the origin the ML estimate has the smallest MSE. However, while we 
have found a very fast decay of the MSE with increasing sample size, the asymptotics considered 
here were for the case of independent observations. A correct asymptotic analysis should study the 
in-field asymptotics; an excellent example with applications to cosmology is Marinucci For finite 
samples, the performance of the estimators should be studied for the particular covariance matrices of 
the cosmological random fields under consideration. In addition, a more careful study should determine 
the robustness of the estimators against deviations from the local model coming from higher order 
terms. 
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We have compared estimators mainly through the MSE and its decrease with increasing sample 
size. We do not think that the Cramer-Rao (CR) inequality is the appropriate tool to select optimal 
estimators: matching the CR bound is not enough to guarantee a good estimator. For example, 
shrinkage estimators arc not found by matching a CR bound. 

We are aware of a certain bias in favor of unbiased estimators in the community. However, all the 
estimators we have considered here are biased. I hope this note helps us remember that: Unbiased 
estimators do not have to exist. When they exist, they may be difficult to find. If we find one, it may 
not be very good. 
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Appendix A. Proof of the non-existence of unbiased estimators. 

We assume that (f> is a finite variance unbiased estimator of <j> based on Y and that <j> a , as defined 
in 12. 7|) . is an unbiased estimator of <j> for any a > and reach a contradiction. This will show that 
there are no finite variance unbiased estimators of (f> based on Y. 
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It is enough to consider > 0. To simplify notation, we define the following functions 

G(a, y) = e -(^+V40)/(2^) cosh ( _J_ /^^T^) 

^ay + lj, 

With this notation 



E MY) = \/^ZJ^ I H(a,y)0(y)dy. (A.l) 

t$/ot 



27T0CT 2 

Result 1. For any y > 0, H(a,y) is a decreasing function of a > and 

lim H(a,y) = 0. 

a — >oo 

Proof: Define 

h(a, y) = ay + 1$ = ay + (pa 2 + — > 

The we can write 



Clearly, the first factor decreases as a increases, as does the second factor when multiplied by the 
decreasing exponential of the hyperbolic cosine. All we have left to show is that 



H(a,y) = e -C'(«.i/)+i/40)/(2<T 2 0) e ^("^)/(2^ 2 * 3/2 )^ 
is also decreasing. To show this, it is enough to show that the following difference decreases 
D(a, y) = yjh{a,y) - ^ ( h(a, y) + 1/40 

= i (y/<f>h(a,y)-(<l>h{a,y) + l/4)) < 0. 
But since <j>h(a,y) > 1/4, we have 



Hence D(a,y) is also a decreasing function of a and so is H(a,y). Finally, since the first exponential 
in the definition of H(a, y) has a negative exponent and D{a, y) < as shown above, it follows that 
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as a — > 00. 



Result 2. 



Proof: Since 



lini 



H(a,y)cj>(y) dy = 



H(a,y) 4>{y)dy 



< 



H(a,y) | <j>(y) \ dy, 



we show that the right hand side goes to zero as a increases. By the last result, H(a, y) is a decreasing 
function of a and therefore 

H(a,y)\$(y)\ < H(l,y)\$(y)\ 

for any a > 1 and y > 0. But the right hand side has a finite integral over [0, oo) because 4>{Y) is a 
finite variance estimator. Hence, by Lebesgue's dominated convergence theorem 



lim 



H(a, y) | (j>(y) \dy = lim H(a, y) \ <j>{y) \dy = 0. 



Result 3 (a). If <f>{y) is bounded in a neighborhood of zero, then 



lim 



(■4,1a 



H(a,y)(j){y)dy = 



Proof: First note that for any y <G (— l^/a, 0) and a > 1 
G{a,y) _ 



a (a-l)/(2<T 2 0) e +e v 



< 1^/(2^^+7^ ( e 27^I + 1 ) = A < OO, 

where 7 = 1/(2ct 2 (/> 3 / 2 ). But since G(l,y) is bounded on (—l,p/a,Q), there are constants and -B^ 
such that 

O<^<G(a,y)<5 
for all a > 1 and y € (— £^/a,0). It follows that 



ff(a,j/)|0(j,)|dy< B 



dy 



Since </>(y) is bounded in a neighborhood of zero, there is a constant C > such that for a large enough 



<7;y 



c 



as ct — > 00 . 

Result 3 (b) If near the origin <f>{y) behaves as 1/| y \P for some /3 G (0, 1), then 



lim 

a—* 00 



H{a,y)(j){y)dy = 
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Proof: For a large enough we have the integral 
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*y f * d v 



-1*1* I v f V a y + U J° y 13 - y 

l * /2 dy [ e * dy 



< a 



/3-1 



y 13 - y J1+/2 y p \Jh~v 

1 t^ 12 dy ( 2 Y f l * d y 



\ yj% Jo V \ W J l^/i y/U-y 



as a — > oo for [3 € (0, 1). 



Remark: The condition that (f> behaves as l/ \ y \@ near the origin is not very restrictive for the following 
reason. We have just shown that E( l/\y\P) < oo for any [3 <E (0, 1) and since <fr has finite variance, it 
follows from Cauchy-Schwarz inequality that 



\y\p I ~ y V \ Y \ 213 

It follows from this inequality that 



dy < oo, 



which already provides enough information to arrive at the same conclusion of Result 3(b) using a 
Mcllin transform analysis (e.g., Bleistcin & Handclsman 

Result 4. 

lim E (j) a (Y) = 0^0, 

a — >oc 

which contradicts the fact that (j) a (Y) is an unbiased estimator of <j> for any a^O. Proof: We split 
the integral in (|A.lj) into two integrals with ranges (—£$/a, 0) and (0, oo) and then apply Results 2 and 
3. 



